One of the most important industry applications of modern Machine Learning technique is in various shape recognication problems. In this project, we attempt to classify a number of vehicles into different clusters based on various geometric features extracted from their silhouettes from different angles.
The purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
This data was originally gathered at the TI in 1986-87 by JP Siebert. It was partially financed by Barr and Stroud Ltd. The original purpose was to find a method of distinguishing 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects. Measures of shape features extracted from example silhouettes of objects to be discriminated were used to generate a classification rule tree by means of computer induction.
This object recognition strategy was successfully used to discriminate between silhouettes of model cars, vans and buses viewed from constrained elevation but all angles of rotation.
The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness.
Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
The images were acquired by a camera looking downwards at the model vehicle from a fixed angle of elevation (34.2 degrees to the horizontal). The vehicles were placed on a diffuse backlit surface (lightbox). The vehicles were painted matte black to minimise highlights. The images were captured using a CRS4000 framestore connected to a vax 750. All images were captured with a spatial resolution of 128x128 pixels quantised to 64 greylevels. These images were thresholded to produce binary vehicle silhouettes, negated (to comply with the processing requirements of BINATTS) and thereafter subjected to shrink-expand-expand-shrink HIPS modules to remove "salt and pepper" image noise.
The vehicles were rotated and their angle of orientation was measured using a radial graticule beneath the vehicle. 0 and 180 degrees corresponded to "head on" and "rear" views respectively while 90 and 270 corresponded to profiles in opposite directions. Two sets of 60 images, each set covering a full 360 degree rotation, were captured for each vehicle. The vehicle was rotated by a fixed angle between images. These datasets are known as e2 and e3 respectively.
A further two sets of images, e4 and e5, were captured with the camera at elevations of 37.5 degs and 30.8 degs respectively. These sets also contain 60 images per vehicle apart from e4.van which contains only 46 owing to the difficulty of containing the van in the image at some orientations.
The vehicle.csv file contains the consolidated output from all measurements for all vehicles, which we will use as our input data.
The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.
Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.
● Exploratory Data Analysis
● Reduce number dimensions in the dataset with minimal information loss
● Train a model using Principle Components, and study model performance
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.plotting.register_matplotlib_converters()
%matplotlib inline
plt.style.use('seaborn-whitegrid')
pd.set_option('display.max_columns', 500)
warnings.filterwarnings("ignore")
# Read the data:
data = pd.read_csv('vehicle.csv')
data.head()
# Let's Check the shape of the dataset
data.shape
data.info()
# Let's check which columns have missing values with missing data count per column
data.isnull().sum()
data.describe().T
Let's handle the missing values in order to be able to visualize the data and prepare it for further processing.
First let's explore the completeness of our dataset with the help of missingno. missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows us to get a quick visual summary of the completeness (or lack thereof) of our dataset.
# !pip install missingno
import missingno as msno
msno.matrix(data)
This shows that there are relatively very few datapoints missing, and we can safely drop a few rows which contain multiple missing values.
# dropping rows which has more than 1 null value
null_values_indexs = []
for i in range(len(data.index)) :
if data.iloc[i].isnull().sum() > 1 :
print("Nan in row ", i , " : " , data.iloc[i].isnull().sum())
null_values_indexs.append(i)
print(f'\nDroping rows: {null_values_indexs}')
data.drop(null_values_indexs, inplace=True)
There are no missing values in our target column 'class'. We would like to replace the string values with numerical values for further processing.
# First let's check the distribution of the target attribute.
ax = sns.countplot(x="class", data=data, palette="pastel")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches:
ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.4, p.get_height()+6), ha='center')
plt.show()
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data['class'] = labelencoder.fit_transform(data['class'])
data['class'].value_counts()
data.isnull().sum()
Now we will make a copy of our dataset for data visualization and use the median values to fill in the rest of the missing values. This is because in data preprocessing, we would like to build a transformation pipeline that we can apply to both train and test data separately, so there is no information leakage that results in overfitting.
data2 = data.copy(deep = True)
# replacing missing values in median of respective column.
for col in data2.columns:
data2[col].fillna(value= data2[col].median(), inplace = True)
data2.isnull().sum()
Let's explore the dateset vialually to understand the nature, shape and distribution of various attributes, find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why.
# Let's Create a function that returns a Pie chart and a Bar Graph for the categorical variables:
def cat_view(x = 'class'):
"""
Function to create a Bar chart and a Pie chart for categorical variables.
"""
from matplotlib import cm
color1 = cm.inferno(np.linspace(.4, .8, 30))
color2 = cm.viridis(np.linspace(.4, .8, 30))
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
"""
Draw a Pie Chart on first subplot.
"""
s = data.groupby(x).size()
mydata_values = s.values.tolist()
mydata_index = s.index.tolist()
def func(pct, allvals):
absolute = int(pct/100.*np.sum(allvals))
return "{:.1f}%\n({:d})".format(pct, absolute)
wedges, texts, autotexts = ax[0].pie(mydata_values, autopct=lambda pct: func(pct, mydata_values),
textprops=dict(color="w"))
ax[0].legend(wedges, mydata_index,
title="Index",
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(autotexts, size=12, weight="bold")
ax[0].set_title(f'{x.capitalize()} Piechart')
"""
Draw a Bar Graph on second subplot.
"""
d = data['class'].value_counts()
splot = ax[1].bar(x = d.index, height = d.values)
for p in splot.patches:
ax[1].annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 5), textcoords = 'offset points')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax[1].set_ylabel('Count')
ax[1].set_title(f'Class Distribution Bar Graph')
fig.tight_layout()
plt.show()
cat_view('class')
pos = 1 # a variable to manage the position of the subplot in the overall plot
for feature in data.columns: # for-loop to iterate over every attribute whose distribution is to be visualized
if pos == 1:
plt.figure(figsize= (30,20)) # Set the figure size
plt.subplot(3, 4, pos) # plot grid
if feature != 'class': # Plot histogram for all the continuous columns
sns.distplot(data2[feature], kde= True )
else:
sns.countplot(data2[feature]) # Plot bar chart for all the categorical columns
pos += 1 # to plot over the grid one by one
if pos > 4:
pos = 1
peak_ponts = {
'circularity': 'Total 2 peaks.',
'distance_circularity': 'Total 2 peaks.',
'radius_ratio': 'Total 2 peaks.',
'max.length_aspect_ratio': 'Total 2 peaks.',
'scatter_ratio': 'Total 2 peaks.',
'pr.axis_rectangularity': 'Total 2 peaks.',
'scaled_variance': 'Total 2 peaks.',
'scaled_variance.1': 'Total 2 peaks.',
'hollows_ratio': 'Total 2 peaks.',
}
from scipy.stats import skew
# skewness = 0 : normally distributed.
# skewness > 0 : more weight in the right tail of the distribution.
# skewness < 0 : more weight in the left tail of the distribution.
for col in data2.columns:
skewness = skew(data2[col])
label = ''
if skewness == 0:
label = 'Normally distributed.'
elif skewness > 0:
label = 'More weight in the right tail of the distribution.' # right skewed
elif skewness < 0:
label = 'More weight in the left tail of the distribution.' # left skewed
if col in peak_ponts:
peak_label = peak_ponts[col]
else:
peak_label = ''
print(f'- Skewness of {col} is {skewness}. {label} { peak_label } \n')
plt.figure(figsize=(15,15))
sns.boxplot(data=data2, orient='h')
From the distribution and boxplots, we can see that the arrtibutes: 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'skewness_about' and 'scaled_radius_of_gyration' are heavily left-skwewd, with a lot of outlier values. But how important are these features with respect to the target, and can we safely ignore these features? We will explore this by studying feature importance at a later step.
from IPython.display import Image
# First, let us analyze pairwise correlation between different predictor attributes.
# sns_plot = sns.pairplot(data2, hue = 'class')
# sns_plot.savefig("pairplot.png")
# plt.clf()
Image(filename='pairplot.png')
# Let's study the Correlation Matrix
plt.figure(figsize = (20,15))
sns.set_style(style = 'white')
g = sns.heatmap(data2.corr(), annot=True, cmap = 'summer_r', square=True, linewidth=1, cbar_kws={'fraction' : 0.02})
g.set_yticklabels(g.get_yticklabels(), rotation=0, horizontalalignment='right')
bottom, top = g.get_ylim()
g.set_ylim(bottom + 0.5, top - 0.5)
# Create correlation matrix
corr_matrix = data2.corr().abs()
# Select upper triangle of correlation matrix
lower = corr_matrix.where(np.tril(np.ones(corr_matrix.shape), k=0).astype(np.bool))
mask = lower == 0 # to mask the upper triangle in the following heatmap
plt.figure(figsize = (15,8)) # setting the figure size
sns.set_style(style = 'white') # Setting it to white so that we do not see the grid lines
g = sns.heatmap(lower, center=0.5, cmap= 'summer_r', annot= True, xticklabels = corr_matrix.index,
yticklabels = corr_matrix.columns, cbar= False, linewidths= 1, mask = mask) # Da Heatmap
g.set_yticklabels(g.get_yticklabels(), rotation=0, horizontalalignment='right')
bottom, top = g.get_ylim()
g.set_ylim(bottom + 0.5, top - 0.5)
plt.xticks(rotation = 50) # Aesthetic purposes
plt.show()
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in lower.columns if any(lower[column] > 0.95)]
# Correlation with Target
predictors = data2.drop('class', axis=1)
predictors.corrwith(data2['class']).plot.bar(figsize = (14, 6), title = "Correlation with Target", fontsize = 12, grid = True)
In order to understand the importance of various features, let's quickly train a Random Forest classifier to make use of the inbuilt 'Feature Importance'.
X = data2.drop('class', axis = 1)
y = data2['class']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify = y)
X_train.shape
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rf_clf.fit(X_train, y_train)
features = list(X_train.columns)
importances = rf_clf.feature_importances_
indices = np.argsort(importances)
fig, ax = plt.subplots(figsize=(10, 7))
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
ax.tick_params(axis="x", labelsize=12)
ax.tick_params(axis="y", labelsize=12)
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance', fontsize = 18)
X = data.drop('class', axis = 1)
y = data['class']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify = y)
print(f'Shape of train data set: {X_train.shape}')
print(f'Shape of test data set: {X_test.shape}')
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
X_train_std = transformer.fit_transform(X_train)
y_train = y_train.to_numpy()
X_test_std = transformer.fit_transform(X_test)
y_test = y_test.to_numpy()
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
# train the model on train set
model = SVC()
model.fit(X_train_std, y_train)
# print prediction results
y_pred = model.predict(X_test_std)
print(classification_report(y_test, y_pred))
svc_score = model.score(X_test_std, y_test)
print(f'SVM accuracy: {svc_score}')
from sklearn import metrics
print("Confusion Matrix:\n",metrics.confusion_matrix(y_pred,y_test))
from sklearn.model_selection import GridSearchCV
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf', 'poly', 'sigmoid']}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 1)
# fitting the model for grid search
grid.fit(X_train_std, y_train)
grid.best_params_
svm_clf = grid.best_estimator_
from sklearn import model_selection
svc_cross_val_score = model_selection.cross_val_score(svm_clf, X_train_std, y_train, cv=3, scoring='accuracy')
print(f'Cross validation score: {svc_cross_val_score}')
# Accuracy on the test set
svc_score = svm_clf.score(X_test_std, y_test)
print(f'SVM Test accuracy: {svc_score}')
from sklearn.decomposition import PCA
pca = PCA()
pca_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('pca', PCA())])
X_pca = pca_transformer.fit_transform(X)
print(pca_transformer.named_steps["pca"].explained_variance_ratio_)
explained_variance_ratio = pca_transformer.named_steps["pca"].explained_variance_ratio_
plt.figure(figsize= (20,15))
plt.subplot(3, 2, 1)
plt.bar(list(range(1,19)), explained_variance_ratio, alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.subplot(3, 2, 2)
plt.step(list(range(1,19)), np.cumsum(explained_variance_ratio), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
Now 11 dimensions seems very reasonable. With 11 variables we can explain over 99% of the variation in the original data.
pca = PCA(2) # project from 22 to 2 dimensions
X_pca = pca.fit_transform(X_train_std)
print(X_train_std.shape)
print(X_pca.shape)
plt.figure(figsize=(14,10))
plt.scatter(X_pca[:, 0], X_pca[:, 1],
c=y_train, edgecolor='none', alpha=0.8,
cmap=plt.cm.get_cmap('rainbow', 10))
plt.xlabel('component 1', fontsize = 14)
plt.ylabel('component 2', fontsize = 14)
plt.colorbar();
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_train_std)
plt.figure(figsize=(14,10))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y_train, edgecolor='none', alpha=0.8,
cmap=plt.cm.get_cmap('jet', 10))
plt.xlabel('component 1', fontsize = 14)
plt.ylabel('component 2', fontsize = 14)
plt.colorbar()
plt.show()
pca_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('pca', PCA(n_components=11))])
X_pca = pca_transformer.fit_transform(X)
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.30, random_state=42, stratify = y)
print(f'Shape of new training data: {X_train_pca.shape}')
print(f'Shape of new test data: {X_test_pca.shape}')
Let's train a Support vector machine using the train set and get the accuracy on the test set.
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf', 'poly', 'sigmoid']}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 1)
# fitting the model for grid search
grid.fit(X_train_pca, y_train)
svm_clf2 = grid.best_estimator_
pca_svc_cross_val_score = model_selection.cross_val_score(svm_clf2, X_train_pca, y_train, cv=3, scoring='accuracy')
print(f'Cross validation score: {pca_svc_cross_val_score}')
# Accuracy on the test set
pca_svc_score = svm_clf2.score(X_test_pca, y_test)
print(f'SVM Test accuracy: {pca_svc_score}')
print(f'\nSVM score: {svc_score}')
print(f'Cross validation score: {svc_cross_val_score}\n')
# print('SVM Score after PCA:')
print(f'SVM Score after PCA: {pca_svc_score}')
print(f'Cross validation score after PCA: {pca_svc_cross_val_score}')
Looks like by drop reducing dimensionality by 7, we only dropped around 1% in accuracy! Because the PCA transformed dataset is isgnificanty more efficient in terms of time and computational resources, the loss in information and slight drop in model performance is an acceptable trade-off here.